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We develop a unified approach for support vector macliines for 
classification and regression in which the outcomes are a function 
of the survival times subject to right censoring. We present a novel 
support- vector regression algorithm that is adjusted for censored data. 
We provide finite sample bounds on the generalization error of the 
algorithm. We prove risk consistency for a wide class of probability 
measures and study learning rates. We apply the general methodol- 
ogy to estimation of the (truncated) mean, median, quantiles, and for 
classification problems. We present a simulation study that demon- 
strates the performance of the proposed approach. 

1. Introduction. In many medical studies, the quantity of interest is the time until some event 
occurs, where typical events of interest include death and cancer remission. The time at which the 
event of interest occurs is called the failure time. Learning the failure time distribution function, 
or quantities that depend on this distribution, as a function of the medical state of the patient, 
is one of the main goals of survival analysis research. Since the medical study may end before the 
failure event occurs, and the patient may drop out of the study, data are typically subject to right 
censoring. In this case, the failure time is not known, and instead, a lower bound on the failure 
time is given. Consequently, when applying learning methods to data from such studies, one needs 
to take into account the censored nature of the observations. 

Estimation of a patients's failure time quantities, such as the expectation and the median, is usu- 
ally done under stringent assumptions on the failure time distribution function. Commonly used 
distribution models include parametric models such as the Weibull distribution, and semipara- 
metric models such as proportional hazard models (see Lawless, 2003, for both). Even when less 
stringent models such as nonparametric estimation are used, it is typically assumed that the distri- 
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bution function is smooth in both time and covariates (Dabrowska, 1987; Gonzalez-Manteiga and 
Cadarso-Suarez, 1994). These assumptions seem restrictive, especially when considering today's 
high-dimensional data settings. 

In this paper, we propose a support vector machine (SVM) learning method for right censored 
data. The choice of SVM is motivated by the fact that SVM learning methods are easy-to-compute 
techniques that enable estimation under weak or no assumptions on the distribution (Steinwart and 
Chirstmann, 2008). SVM learning methods, which we review in detail in Section 2, are a collection of 
algorithms that attempt to minimize the risk with respect to some loss function. An SVM learning 
method typically minimizes a regularized version of the empirical risk over some reproducing kernel 
Hilbert space (RKHS). The obtained minimizer is referred to as the SVM decision function. The 
SVM learning method is the mapping that assigns to each data set its corresponding SVM decision 
function. 

We adapt the SVM framework to right censored data as follows. First, we represent the distri- 
bution's quantity of interest as a Bayes decision function, i.e., a function that minimizes the risk 
with respect to a loss function. We then construct a data-dependent version of this loss function 
using inverse-probability-of-censoring weighting (Robins et al., 1994). We minimize a regularized 
empirical risk with respect to this data-dependent loss function to obtain an SVM decision function 
for censored data. Finally, we define the SVM learning method for censored data, or censored SVM 
learning method in short, as the mapping that assigns for every censored data set its corresponding 
SVM decision function. 

Note that unlike the standard SVM decision function, the censored SVM decision function is 
obtained as the minimizer of a data-dependent loss function. In other words, for each data set, a 
different minimization loss function is defined. Moreover, minimizing the empirical risk no longer 
consists of minimizing a sum of i.i.d. observations. Consequently, different techniques are needed 
to study the theoretical properties of the censored SVM learning method. 

We prove a number of theoretical results for the proposed censored SVM learning method. We 
first prove that the censored SVM decision function is measurable and unique. We then show that 
the censored SVM learning method is a measurable learning method. We provide a probabilistic 
finite-sample bound on the difference in risk between the learned censored SVM decision function 
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and the infimum risk within the RKHS. We further show that the SVM learning method is consis- 
tent for every probabihty measure for which the censoring is independent of the failure time given 
the covariates, and the probability that no censoring occurs is positive given the covariates. Finally, 
we compute learning rates for the censored SVM learning method. We also provide a simulation 
study that demonstrate the performance of the censored SVM learning method. Our results are 
carried out under some conditions on the approximation RKHS and the loss function, which can be 
easily verified. We also assume that the estimation of censoring probability at the observed points 
is consistent. 

One drawback of the proposed approach is the need to estimate the censoring probability at 
observed points. This estimation is required in order to use inverse-probability-of-censoring weight- 
ing for constructing the data-dependent loss function. We remark that in many applications it is 
reasonable to assume that the censoring mechanism is simpler than the failure-time distribution; 
in these cases, estimation of the censoring distribution is typically easier then estimation of the 
failure distribution. For example, the censoring may depend only on a subset of the covariates, or 
may be independent of the covariates; in the latter case, an efficient estimator exists. Moreover, 
when the only source of censoring is administrative, in other words, when the data is censored 
because the study ends at a prespecified time, the censoring distribution is often known to be in- 
dependent of the covariates. The results presented in this paper hold for any censoring estimation 
technique. We present results for both correctly specified and misspecified censoring models. We 
also discuss in detail the special cases of the Kaplan- Meier and the Cox model estimators (Fleming 
and Harrington, 1991). 

While the main contribution of this paper is the proposed censored SVM learning method and 
the study of its properties, an additional contribution is the development of a machine learning 
framework for right censored data. The principles and definitions that we discuss in the context 
of right censored data, such as learning methods, measur ability, consistency, and learning rates, 
are independent of the proposed SVM learning method. This framework can be adapted to other 
learning methods for right censored data, as well as for learning methods for other missing data 
mechanisms. 

Other learning algorithms have been suggested for survival data. Biganzoli et al. (1998) and 
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Ripley and Ripley (2001) used neural networks. Johnson et al. (2004), Shivaswamy et al. (2007), 
Shim and Hwang (2009), and Zhao et al. (2011), among others, suggested different versions of 
SVM. As far as we know, the theoretical properties of these algorithms have never been studied. In 
the context of multistage decision problems, Goldberg and Kosorok (2012) proposed a Q-learning 
algorithm for right censored data for which a theoretical justification is given. However, the algo- 
rithm discussed therein is not an SVM learning method, and it is assumed that the censoring is 
independent of both failure time and covariates. We believe that this work is an important step in 
developing methodology for learning survival data. 

The paper is organized as follows. In Section 2 we review right-censored data and SVM learning 
methods. In Section 3 we discuss the use of SVM for right-censored data, when no censoring is 
present. Section 4 discusses the difficulties that arise when applying SVM to right censored data and 
proposes a censored SVM learning method. Section 5 contains the main theoretical results, including 
finite sample bounds and consistency. Simulations appear in Section 6. Concluding remarks appear 
in Section 7. The lengthier proofs are provided in the Appendix. 

2. Preliminaries. In this section we establish the notation used throughout the paper. We 
begin by introducing right censored data (Section 2.1). We then discuss loss functions (Section 2.2). 
Finally we discuss SVM learning methods (Section 2.3). For right censored data we follow Fleming 
and Harrington (1991) (hereafter abbreviated FH91). For the loss function and the SVM definitions 
we follow Steinwart and Chirstmann (2008) (hereafter abbreviated SC08). 

2.1. Right Censored Data. We assume that data consist of n independent and identically- 
distributed random triplets D = {(Zi, C/i, 5i ),..., (Z„, C/„, (5„)}. The random vector Z is a covariate 
vector that takes its values in a compact set Z C M'^. The random variable U is the observed time 
defined hy U = T A C, where T > is the failure time, C is the censoring time, and where 
a A 6 = min(a, b). The indicator 6 = 1{T < C} is the failure indicator, where 1{^} is 1 if j4 is true 
and is otherwise, i.e., 6 = 1 whenever a failure time is observed. 

Let S{t\Z) = P{T > t\Z) be the survival functions of T, and let G{t\Z) = P{C > t\Z) be the 
survival function of C. We make the following assumptions: 

(Al) C takes its values in the segment [0, r] for sone finite r > 0, and inf^g^ G{t — \z)'> 2K > 0. 
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(A2) C is independent of T, given Z. 

The first assumption assures that there is a positive probabihty of censoring over the observation 
time range ([0, r]). Note that the existence of such a r is typical since most studies have a finite 
time period of observation. In the above, we also define F{t—) to be the left-hand limit of a right 
continuous function F with left-hand limits. The second assumption is standard in survival analysis 
and ensures that the joint nonparametric distribution of the survival and censoring times, given 
the covariates, is identifiable. 

We assume that the censoring mechanism can be described by some simple model. Below, we 
consider two possible examples, although the main results do not require any specific model. First, 
we need some notation. For every t G [0, r], define N(t) = 1{U < t,6 = 0} and Y(t) = 1{U > t} -|- 
1{U = t,6 = 0}. Note that since we are interested in the survival function of the censoring variable, 
N(t) is the counting process for the censoring, and not for the failure events, and Y(t) is the at-risk 
process for observing a censoring time. For a cadlag function A on (0, r], define the product integral 
4>{A){t) = Y[o<s<ti^ + dA{s)) (van der Vaart and Wellner, 1996). Define Pn to be the empirical 
measure, i.e., P„/(X) = Ya=i fi^i)- 

Example 1. Independent censoring: Assume that C is independent ofbothT and Z. Define 

™„c?N(s) 



A(t) 



AY(s) 

Then Gn{t) = (p{—A){t) is the Kaplan-Meier estimator for G. Gn is a consistent and efficient 
estimator for the survival function G (FH91). 

Example 2. The proportional hazards model: Consider the case that the hazard of G 
give Z is of the form e^'^dK for some unknown vector (3 € M'^ and some continuous unknown 
nondecreasing function A with A(0) = and < A(t) < oo. Let (3 he the zero of the estimating 
equation 

Define 

^* ™„dN(s) 



A(t) 



P„Y(s)e^'^ 
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Then Gn{t\z) = (/.(I - e^'^A(t)) is a consistent and efficient estimator for survival function G 
(FH91 ). 

Even when no simple for the censoring mechanism is assumed, the censoring distribution can be 
estimated using a generahzation of the Kaplan-Meier estimator of Example 1. 

Example 3. Generalized Kaplan-Meier: Let : Z x Z R be a kernel function of width 
a. Define N{t,z) = K{z,Z)l{U <t,d = 0} and Y{t,z) = K{z,Z){l{U > t} + 1{U = t,6 = 0}). 
Define 

FndN{s, z) 



Then the generalized Kaplan-Meier estimator is given by Gn{t\z) = (j){—A){t\z) , where the product 
integral (j) is defined for every fixed z. Under some conditions, Dabrowska (1987, 1989) proved 
consistency of the estimator and discussed its convergence rates. 

Usually we denote the estimator of the survival function of the censoring variable G{t\Z) by 
Gn{t\Z) without referring to a specific estimation method. When needed, the specific estimation 
method will be discussed. When independent censoring is assumed, as in Example 1, we denote 
the estimator by Gn{t)- 

Remark 4. By Assumption (Al), 'mlz(zz G{t\z) > 2K > 0, and thus if the estimator Gn is 
consistent for G, then, for n large enough, ini zez Gn{T\z) > K > 0. In the following, for simplicity, 
we assume that the estimator Gn is such that miz Gn{T\Z) > K > 0. In general, one can always 
replace Gn by Gn V Kn, where Kn — t- 0. /n that case, for all n large enough, inf^ Gn{T\Z) > K > {) 
and for all n, inf G„ > 0. 

2.2. Loss Functions. Let the input space {Z,A) be a measurable space. Let the response space 
3^ be a closed subset of M. Let P be a measure on Z x y. 

A function L : 2^ x 3^ x M i— [0, oo) is called a loss function if it is measurable. We say that a loss 
function L is convex if L{z, y, ■) is convex for every z £ Z and y £ y. We say that a loss function 
L is locally Lipschitz continuous with Lipschitz local constant function cl(-) if for every a > 

sup \L{z,y,s) - L{z,y, s')\ < CL{a)\s - s'\ , s,s' £ [-a, a] . 
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We say that L is Lipschitz continuous if there is a constant such that the above holds for any 
a with CL{a) = cl- 

For any measurable function / : Z M we define the L-risk of / with respect to the measure 
P as 'R.L,p{f) = Ep[L{Z,Y, f{Z))]. We define the Bayes risk Ti^ p of / with respect to loss 
function L and measure P as inf j TlL,p{f), where the infimum is taken over all measurable functions 
f : Z M.. A function /£ p that achieves this infimum is called a Bayes decision function. 

We present a few examples of loss functions and their respective Bayes decision functions. In the 
next section we discuss the use of these loss functions for right censored data. 

Example 5. Binary classification: Assume that y = {—1,1}. We would like to find a 
function f : Z t-^ {—1,1} such that for almost every z, P{f{z) = Y\Z = z) > 1/2. One can 
think of f as a function that predicts the label y of a pair (z, y) when only z is observed. In 
this case, the desired function is the Bayes decision function /£ p with respect to the loss function 
Lbc(-2, y, s) = l{y ■ sign(s) / 1}. /n practice, since the loss function Lbc is not convex, it is usually 
replaced by the hinge loss function LhlI-z, y, s) = max{0, 1 — ys}. 

Example 6. Expectation: Assume that 3^ = M. We would like to estimate the expectation of 
the response Y given the covariates Z. The conditional expectation is the Bayes decision function 
f^p with respect to the squared error loss function Li^s{z,y, s) = {y — s)^. 

Example 7. Median and quntiles: Assume that y = W. We would like to estimate the 
median ofY\Z. The conditional median is the Bayes decision function f-^p for the the absolute 
deviation loss function Lad{z, y, s) = \y — s\. Similarly, the a-quantile of Y given Z is obtained as 
the Bayes decision function for the loss function 



La{z,y,s) 



-{l-a){y-s) ifs>y 

, aG (0,1). 

a{y — s) if s < y 



Note that the functions Lhl, LhS, Lad, and La for a G (0, 1) are all convex. Moreover, all these 
functions except Lls are Lipschitz continuous, and Lls is locally Lipschitz continuous when y is 
compact. 
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2.3. Support Vector Machine (SVM) Learning Methods. Let L be a convex locally Lipschitz 
continuous loss function. Let be a separable reproducing kernel Hilbert space (RKHS) of a 
bounded measurable kernel on Z (for details regarding RKHS, the reader is referred to SC08, 
Chapter 4). 

Let Do = {(.Z'l, Yi), . . . , {Zn,Yn)} be a set of n i.i.d. observations drawn according to the prob- 
ability measure P. Fix A and H be as above. Define the empirical SVM decision function 

(1) fDo,x = argmin All/Ill, + TZlMD > 

where 

1 " 

nL,Do{f)^^nL{Z,Y,f{Z)) ^-TL{Zi,Y„f{Z,)) 

i=l 

is the empirical risk. 

For some sequence {An}, define the SVM learning method £, as the map 

(Z X 3^)" X Z M 

(2) 

iDo,z) ^ fooM 

for all n > 1 . We say that £ is measurable if it is measurable for all n with respect to the minimal 
completion of the product cr-field on (Z x 3^)" x Z. We say that that £ is (L-risk) P-consistent if 
for all e > 

(3) hm P{Do e{Zx 3^)" : TZlAIdoaJ < n,p + e) = 1 . 

We say that £ is universally consistent if for all distributions P on ^ x 3^, £ is P-consistent. 
We summarize some known results regarding SVM learning methods. 

Theorem 8. Let L be a convex locally Lipschitz continuous loss function, L : Z x y x M 
[0,00). Let H be a separable RKHS of a bounded measurable kernel on a compact set Z C M'^. 
Choose < A„ < 1 such that A„ — t- 0, and A^n — t- 00. Then 

(a) The empirical SVM decision function fDo,Xn exists and is unique. 

(b) The SVM learning method £ defined in (2) is measurable. 

(c) The L-risk 'JlL,p{fDo,xJ mif(zH'RL,p{f)- 
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(d) If the RKHS H is dense in the set of integrable functions on Z, then the SYM learning method 
£ is universally consistent. 

The proof of (a) follows from SC08, Lemma 5.1 and Theorem 5.2. For proof of (b), see SC08, 
Lemma 6.23. The proof of (c) follows from SC08 Theorem 6.24. The proof of (d) follows from SC08, 
Theorem 5.31, together with Theorem 6.24. 

3. SVM for Survival Data without Censoring. In this section we present a few examples 
of the use of SVM for survival data. We discuss the case in which no censoring is introduced. 
We show how different quantities of the conditional distribution of T given Z can be represented 
as Bayes decision functions. We then show how SVM learning methods can be applied to these 
estimation problems and review theoretical properties of these SVM learning methods. In the 
next section we will explain why the standard SVM techniques cannot be employed directly when 
censoring is introduced. 

Let (Z, T) be a random vector where Z is a covariates vector that takes its values in a compact 
set Z C M'^, T is survival time that takes it values at T = [0, r] for some positive constant r, and 
where (Z, T) are distributed according to a probability measure P on 2^ x 7". 

Note that the conditional expectation Ep[r\Z] is the Bayes decision function for the least squares 
loss function Lls- In other words 

Ep[T\Z] = argmini?p[LLs(Z,r, f{Z))] , 
f 

where the minimization is taken over all measurable real functions on Z (see Example 6). Similarly, 
the conditional median and the a-quantile of T\Z can be shown to be the Bayes decision functions 
for the absolute deviation function -Lad and L^, respectively (see Example 7). In the same manner, 
one can represent other quantities of the conditional distribution T\Z using Bayes decision function. 

Defining quantities of the survival function as Bayes decision functions is not limited to regression 
(i.e., to a continuous response). Classification problems can also arise in the analysis of survival 
data (see, for example, Ripley and Ripley, 2001; Johnson et al., 2004). For example, let p, < p < r 
be a cutoff constant. Assume that survival to a time greater than p is considered as death unrelated 
to the disease (i.e., remission) and a survival time less than or equal to p is considered as death 
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resulting from the disease. Denote 



(4) 



Y{T) = < 



1 T> p 



1 T < p 



In that case, the decision function that predicts remission when the probability of y = 1 given 
the covariates is greater than 1/2 and failure otherwise is a Bayes decision function for the binary 
classification loss Lbc of Example 5. 

Let Dq = {(Zi,Ti), . . . , {Zn,Tn)} be a data set of n i.i.d. observations distributed according to 
P. Let Yi = Y{Ti) where Y{-) : T i— s- 3^ is some deterministic measurable function. For regression 
problems, Y is typically the identity function and for classification Y can be defined, for example, 
as in (4). Let L be a convex locally Lipschitz continuous loss function, L : ^ x 3^ x M i— t- [0, oo). 
Note that this include the loss functions LlSj-^ADi La, and Lhl- Define the empirical decision 
function as in (1) and the SVM learning method £ as in (2). Then, it follows from Theorem 8, for 
an appropriate RKHS H and regularization sequence {A„} that £ is measurable and universally 
consistent. 

4. Censored SVM. In the previous section, we presented a few examples of the use of SVM 
for survival data. In this section we explain why standard SVM techniques cannot be applied 
directly when censoring is introduced. We then explain how to use the inverse probability censoring 
weighting (Robins et al., 1994) to obtain a censored SVM learning method. Finally, we show that 
the obtained censored SVM learning method is well defined. 

Let D = {{Zi, Ui, (5i), . . . , {Zn, Un, ^n)} be a set of n i.i.d. random triplets of right censored data 
(see Section 2.1). Let L : Z x 3^ x M i— t- [0, oo) be a convex locally Lipschitz loss function. Let H be 
a separable RKHS of a bounded measurable kernel on Z. We would like to find an empirical SVM 
decision function. In other words, we would like to find the minimizer of 



where A > is a fixed constant, and V : T i— )• 3^ is a known function. The problem is that the failure 
times Tj may be censored, and thus unknown. While a simple solution is to ignore the censored 
observations, it is well known that this can lead to severe bias (Tsiatis, 2006). 




i=l 
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In order to avoid this bias, one can reweight the uncensored observations. Note that at time Tj, 
the i-th. observation has probabihty G{Ti — \Zi) = P{Ci > Ti\Zi) not to be censored, and thus, one 
can use the inverse of the censoring probabihty for reweighting in (5) (Robins et al., 1994). 

More specifically, define the random loss function L" : (Z xTx {0, 1})" x (Z xTx {0, 1} xR) i-;- M 

by 



L{z,Y{u),s) 



L^iD,iz,u,6,s)) 



5 = 1 



Gn(u\z) 

5 = 



where Gn is the estimator of the survival function of the censoring variable based on the set of n 
random triplets D (see Section 2.1). When D is given, we denote L'^£){-) = L^^{D, •). Note that the 
function LJi is no longer random. In order to show that L^) is a loss function, we need to show 
that LJ) is a measurable function. 

Lemma 9. Let L be a convex locally Lipschitz loss function. Assume that the estimation pro- 
cedure D I— )• Gn{-\-) is measurable. Then for every D €z (Z x T x. {0,1})" the function : 
(Z X T X {0, 1}) X M i-> M is measurable. 

Proof. By Remark 4, the function Gn{u\z) i— t- 1/ Gn{u\z) is well defined. Since by definition, 
both Y and L are measurable, we obtain that (n, z, (5) i— 5L{Y{u),z)/Gn{u\z) is measurable. □ 

We define the empirical censored SVM decision function to be 

(6) /B,A = argminA||/||2, +7^L.,I5(/) = argmin A||/||2, + ^ V L2,(Zi, i7„ 5,, /(Z,)) . 
feH feH n ^ 

The existence and uniqueness of the empirical censored SVM decision function is ensured by the 
following lemma: 

Lemma 10. Let L he a convex locally Lipschitz loss function. Let H be a separable RKHS of 
a hounded measurable kernel on Z. Then there exists a unique empirical censored SVM decision 
function. 

Proof. Note that given L), the loss function L^{z,u,5, •) is convex for every fixed z, u, and 6. 
Hence, the result follows from Lemma 5.1 together with Theorem 5.2 of SC08. □ 
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Note that the empirical censored SVM decision function is just the empirical SVM decision 
function of (1), after replacing the loss function L with the loss function U^. However, there are 
two important implications to this replacement. First, empirical censored SVM decision functions 
are obtained by minimizing a different loss function for each given data set. Second, the second 
expression in the minimization problem (6), namely, 



is no longer constructed from a sum of i.i.d. random variables. 

We would like to show that the learning method defined by the empirical censored SVM decision 
functions is indeed a learning method. We first define the term learning method for right censored 
data or censored learning method for short. 

Definition 11. A censored learning method on Z x T maps every data set D £ (Z x T x 
{0, 1})", n>\, to a function //) : Z i— )• M. 

Choose < A„ < 1 such that An — >• 0. Define the censored SVM learning method as 
2!^{D) = fjjXn all n > 1. The measurability of the censored SVM learning method £^ is 
ensured by the following lemma, which is an adaptation of Lemma 6.23 of SC08 to the censored 
case. 

Lemma 12. Let L be a convex locally Lipschitz loss function. Let H be a separable RKHS of a 
bounded measurable kernel on Z. Assume that the estimation procedure D i— )■ G{-\-) is measurable. 
Then the censored SVM learning method £,'^ is measurable, and the map D i-^ ff^ is measurable. 

Proof. First, by Lemma 2.11 of SC08, for any f £ H, the map {z,u,f) i-> L{z,Y{u), f{z)) is 
measurable. The survival function G„ is measurable on (ZxMx {0, 1})" x[ZxM) and by Remark 4, 
the function D i— 6i/Gn{ui\zi) is well defined and measurable. Hence D i— t- n^^ X^ILi '^'^^^'^^"'^'■^^^'■'^ 
is measurable. Note that the map / i— )■ A„||/|||^ where f £ H is also measurable. Hence we obtain 
that the map cf) : {Z x T x {0, 1})" x H ^R, defined by 




i=l 



ct>{D,f) = \\\f\\j, + nL-^Mf) 
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is measurable. By Lemma 10, is the only element of H satisfying 

<t>{DJ'i,^^J=iniJ{DJ). 

By Aumann's measurable selection principle (SC08, Lemma A. 3. 18), the map D is mea- 

surable with respect to the minimal completion of the product cj-field on x T x {0, 1})". Since 
the evaluation map {f,z) i— )• f{z) is measurable (SC08, Lemma 2.11), the map {D,z) i— )• ffjx^iz) 
is also measurable. □ 

5. Theocratical Results. In the following, we discuss some theoretical results regarding the 
censored SVM learning method, proposed in Section 4. In Section 5.1 we discuss finite sample 
bounds. In Section 5.2 we discuss consistency. Learning rates are discussed in Section 5.3. Finally, 
censoring model misspecification is discussed in Section 5.4. 

5.1. Finite Sample Bounds. Define fp^\ = infjg// ^^11/111/ ~^'^L,p{f) and define the approxima- 
tion error 

(7) ^2(A) = X\\fp,xfH + nL,pifp,x) - mf ^L,p(/) . 

Define the censoring estimation error 

Errn{t,z) = Gn{t\z) - Git\z) 

to be the difference between the estimated and true survival functions of the censoring variable. 
Let H be an RKHS over the covariates space Z, where we assume throughout this section that 
Z is a compact subset of M'^. Let Bh = {f e H : \\f\\H < 1} be the unit ball in the RKHS H. 
Denote J\f{BH, \\ ■ Woo,^) to be the e-covering number of Bh with respect to the norm || • ||oo5 i.e., 
the minimum number of sup- norm e-balls that covers Bjj- 

We are now ready to establish a finite-sample bound for the generalization of censored SVM 
learning methods: 

Theorem 13. Let L : Z x y x M. i-^ [0, oo) be a convex, locally Lipschitz continuous loss 
function satisfying L{z,y,0) < 1 for all {z,y) ^ Z x y. Let H be a separable RKHS over Z with 
continuous kernel k satisfying \\k\\oo < 1. Let {Z,T,C) be distributed according to P. Let Gn{t\Z) 
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be an estimator of the survival function of the censoring variable and assume (Al)-(A2). Then, 
for any fixed regularization constant A>0, n>l,e>0, and rj > 0, with probability not less than 



K 



(ci ((KA)-V2) (i^A)-i/2 + 1) / 2r? + 21og(2AA(5j,,||.||oo,(KA)i/2e)) \\Errn\\oo 
+ K n + ~f^ 

The proof appears in Appendix A.l. We remark that the bound above depends on the distribution 
P through the constant K and the error terms A2 and Errn- 

For the Kaplan- Meier estimator (see Example 1), under some conditions, bounds of the random 
error H-ErrnHoo were established (Bitouze et al., 1999). In this case we can replace the the bound 
of Theorem 13 with a more explicit one. 

More specifically, let Gn be the Kaplan-Meier estimator. Then, for every n > 1 and e > the 
following Dvoretzky-Kiefer-Wolfowitz-type inequality holds (Bitouze et al., 1999, Theorem 2): 

(8) P{\\Gn - Glloo >^)<\ exp{-2nKo'e' + Co^K^e} , 

where Kq = P{T > r) is a lower bound on the survival function at r, and where Co is some 
universal constant (see Wellner, 2007, for a bound on Co). Fix r] > and write 



Some algebraic manipulations then yield 



(9) P(||G'n-G||oo> ^^5:° )<^e- 



As a result, we obtain the following corollary: 



Corollary 14. Consider the setup of Theorem 13. Assume that the censoring variable C is 
independent of both T and Z . Let Cn be the Kaplan-Meier estimator of C. Then for any fixed 
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regularization constant X, n > 1, e > 0, and rj > 0, with probability not less than 1 — \e.~'^ , 

,aIIh + ^l,p(/E,,a)- inf 71l,p(/)< 
K 



[CL [{K\)-'l^) (KXyy^ + 1) / 2r^ + 2log{2M{BH,\\-\UiKXy/^e)) ^ + Co 
K 1 V n 2KKq^ 



5.2. V -universal Consistency. In this section we discuss consistency of the censored SVM learn- 
ing method proposed in Section 4. In general, P-consistency means that (3) holds for all e > 0. 
Universal consistency means that the learning method is P-consistent for every probability measure 
Pon^xXxjO,!}. In the following we discuss a more restrictive notion than universal consistency, 
namely "P-universal consistency. Here, V is the set of all probability distributions for which there 
is a constant K such that conditions (A1)-(A2) hold. We say that a censored learning method 
is 'P-universally consistent if (3) holds for all P G P. We note that when the first assumption is 
violated for a set of covariates Zq with positive probability, there is no hope of learning the optimal 
function for sdl Z & Z, unless some strong assumptions on the model are enforced. The second 
assumption is required for proving consistency of the learning method 2^ proposed in Section 4. 
However, it is possible that other censored learning techniques will be able to achieve consistency 
for a larger set of probability measures. 

In order to show "P-universal consistency, we utilize the bound given in Theorem 13. This bound 
depends on four different terms: the approximation error A2, the entropy of the ball Bjj, the 
(locally) continuous Lipschitz constant cl, and the error in the estimation of the survival function 
G. We need the following assumptions: 

(Bl) H is a separable RKHS over Z with universal kernel k satisfying \\k\\oc, < 1 (see Remark 15 

below for definition of a universal kernel). 
(B2) There are constants a > 1 and p > 0, such that for every e > 0, the entropy of Bh is bounded 

as follows: 

(10) logAr(PjY, II • ||oo,e) < ae-2p. 

(B3) There is a constant q > 0, such that the (locally) Lipschitz constant is bounded by cl(A) < cA^ 
for all A large enough. 
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(B4) Gn is consistent for G and there is a finite constant s > sucli that P(||£^rr„||oo > 6n~^/*) — 
for any 6 > 0. 

Before we state the main result regarding "P-universal consistency, we present some examples for 
which the assumptions above hold: 

Remark 15. A continuous kernel k whose corresponding RKHS H is dense in the class of 
continuous functions over Z is called universal. Examples for universal kernels include the Gaussian 
kernels, and the Taylor kernels. For more details, the reader is refereed to SC08, Chapter 4-6. Recall 
that IZ*^ p = inf f TZL,p{f) where the infimum is taken over all measurable functions f . For universal 
kernels, we have \ni f^H'T^L,p{f) = T^ip- (SC08, Corollary 5.29). 

Remark 16. The entropy hound (10) of Assumption (B2) is satisfied for both Taylor and 
Gaussian kernels for all p > (see SC08, Section 6.4). 

Remark 17. Assumption (B3) holds with with g = 1 for LlS; (^'^-d with g = for LhLj -^AD; 
and La (See Section 2.2 for the definitions of the loss functions). 

Remark 18. Assume that Gn is consistent for G. When Gn is the Kaplan-Meier estimator 
(see Example 1), then Assumption (B4) holds for all s > 2 (Bitouze et al., 1999, Theorem 3). 
Similarly, when Gn is the proportional hazard estimator (see Example 2), under some conditions. 
Assumption (B4) holds for all s > 2 (see Goldberg and Kosorok, 2011, Theorem 3.2 and its con- 
ditions). When Gn is the generalized Kaplan-Meier estimator (see Example 3), then under some 
conditions. Assumption (B4) holds for all s > d/2 + 2 where d is the dimension of the covariate 
space (see Dabrowska, 1989, Corollary 2.2 and its conditions). 

Now we are ready for the main result: 

Theorem 19. Let L : ^ x 3^ x R i— )• [0, oo) he a convex, locally Lipschitz continuous loss function 
satisfying L{z,y,0) < 1 for all {z,y) £ Z xy. Assume conditions (B1)-(B4). Let A„ — )• 0, where 
< An < 1 and _\^^+-'^)/^7j,™n{2(p+i),s} _j, where q and s are as defined in Assumptions (B3) and 
(B4), respectively. Then £^ is V -universal consistent. 
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Proof. Fix < A < 1. Denote cx = cl {{KX)'^/"^) and note that by Assumption (B3), ca A 1 < 
ciA^"^/^ for some constant ci. By Theorem 13, equation (10), and the fact that + y < -^/x + y^ 
for any positive constants x and y, we obtain 

with probabihty at least 1 — e~^. By Lemma A. 1.5 of SC08, (see also page 228), 

/ 9 „ \ 1/2 / 9 \ / 9n \ l/(2p+2) / ^ \ l/(2p+2) 

.n.n2(A-A)V...(^) ((A-A)^/^- + D (|) (^) £ 3 (f 

where the minimizer is given by 



.2/ V n y 

By Assumption (B4) and the fact that H-ErrnHoo < 1, there exists a constant C2 = C2{r]) that 
depends only on r], such that for all n > 1, 

This inequality, together with the bound on cx given above yield that with probability of at least 
1 - 2e-^, 

M\rD,xfH + T^LAfh,x)- l^tT^LAf) 



< ^2(A) + 3 - + M + 



K3/2 \ \n J \n J K 

Assumption (Bl), together with Lemma 5.15 of SC08 yield that A2{\) — )• as A — )• 0. By assump- 
tion, \^'i^^)l'^n™^^^'^^'P^'^'>'^^ —7- oo ensures that the second expression in the RHS of (12) converges 
to zero, which implies (3). Since (3) holds for every P £ V, we obtain P-consistency. □ 

5.3. Learning Rates. In the previous section we discussed P-universal consistency which en- 
sures that for every probability P € "P, the learning method asymptotically learns the optimal 
function. In this section we would like to study learning rates. 

We define learning rates for censored learning methods similarly to the definition for regular 
learning methods (see SC08, Definition 6.5): 
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Definition 20. Let L : Z xy xR^^ [0, oo) be a loss function. Let P gV be a distribution. We 
say that a censored learning method learns with a rate {en}n, where {e-n} C (0, 1] is a sequence 
decreasing to 0, if for some constant cp > 0, all n>l, and all r] G [0,oo), there exists a constant 
Cr^ G [l,oo) that depends on rj and {e-n} but not on P, such that 

P{D€{Zxrx {0, 1})" : nL,p{fh,x) < T^1,P + cpcr^sn) > 1 - e-" . 

In order to study the learning rates, we need an additional assumption: 
(B5) There exist constants C3 and /? G (0, 1] such that 

^2(A)<C3A^, A>0, 

where A2 is the approximation error function defined in (7). 

Lemma 21. Let L : Z x y xM. t-^ [0,oo) be a convex, locally Lipschitz continuous loss func- 
tion satisfying L{z,y,0) < 1 for all {z,y) & Z x y. Assume conditions (B1)-(B5). Then the 

p 

learning rate of is given by n (2/3+9+1) min{p+i, 3/2} ^ where p, q, s, and /3, are as defined in As- 
sumptions (B2), (B3), (B4), and (B5), respectively. 

Proof. By Assumption (B5) and Eq. (11), 



i^3/2 \ \n ) \n ) K 

where a is defined in Assumption (B2), and C2 = C2{r]) depends on r/ but not on n (see proof of 

1 

Theorem 19). Choose A„ to be a sequence that behaves like n (2/3+<j+i)min{p+i,s/2} ^ Then, it follows 
from (12) that 



P[nLAfD^)- jlljT^LAf) <cp{^Jv + C2{'n))n (2/3+,+i)min{p+i,./2}J > i_2e-\ 

where cp is independent of r/. □ 

5.4. Misspecified Censoring Model. In the previous subsection we showed that under conditions 
(B1)-(B4), the censored SVM learning method £f^ is "P-universally consistent. While one can choose 
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the Hilbert space H and the loss function L in advance such that conditions (B1)-(B3) hold, 
condition (B4) need not hold when the censoring mechanism is misspecified. In the following, we 
consider this case. 

Let Gn{t\z) be the estimator of the survival function for the censoring variable. The deviation of 
Gn{t\z) from the true survival function G{t\z) can be divided into two terms. The first term is the 
deviation of the estimator from its limit, while the second term is the difference between 

the estimator limit and the true survival function. When the model is correctly specified, and the 
estimator is consistent, the second term vanishes. More formally, let Gp{t\z) be the limit of the 
estimator under the probability measure P, and assume it exists. Define the errors 

Errn{t,z)=Errni{t,z) + Err2{t,z)= (^^(tlz) - Gp(t|z)) + (Gp(t|z) - G(t|z)) . 

Note that Err^i is a random function that depends on the data, the estimation procedure, and 
the probability measure P, while Err2 is a fixed function that depends only on the estimation 
procedure and the probability measure P. 

In the following, we would like to find finite sample bounds when the censoring model is misspec- 
ified. This will be done by refining the proof of Theorem 13. First, we need to introduce the concept 
of clipping. We say that a loss function L can be clipped at M > 0, if, for all (z, y, s) G ^ x 3^ x R, 

L{z,y,s) < L{z,y,s) 

where s denotes the clipped value of s at ±M, that is, 

-M if s < -M 
s if - M<s<M 
M if s > M 

(see SC08, Definition 2.22). The loss functions Lhl, -^lS) -^ad, and Lq, can be clipped at some 
M when y = T or y = {—1, 1} (SC08, Chapter 2). Moreover, we have the following criterion for 
clipping which is usually easy to check. 



Lemma 22. Let Z and y be compact. Let L : Z x y xM. i-^ [0,oo) be continuous and strictly 
convex, with bounded minimizer for every {z,y) £ Z x y. Then L can be clipped at some M. 
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See proof in Appendix A. 3. 

For a loss function L that can be clipped at M, let 

(13) W= max L{z,y,s). 

zeZ,yeY,se[-M,M] 

For a function /, we define / to be the clipped version of /, i.e., max{— M, min{M, /}}. 

Theorem 23. Let L : Z xy xM [0, oo) be a convex, locally Lipschitz continuous loss function 
satisfying L(z, y, 0) < 1 for all {z,y) €z Z xy and that can be clipped at M > 0, and let the constant 
W be defined by (13). Let Gp = limG„, and assume that condition (Al) holds for both G and Gp. 
Assume also condition (A2) holds. Let Wq = \\L{z,t^ fp^\{z))\\oo- Then for any fixed regularization 
constant X > 0, n > 1, e > 0, and rj > 0, with probability not less than 1 — e~^^, 

rnhjl + Kl,p(/S,a) - jjfj'^i.pV) < 



K ' ' K QKn 



{cL ((i^A)-i/2) (KA)-V2 + 1) 27^ + 21og {2MiBH, \\ ■ ||oo, (KXy/^e)) 
+ K V n 

The proof appears in Appendix A. 2. 

As a consequence of Theorem 23, we can prove that even under misspecification of the censored 
data model, the censored learning method Z'^ achieves the optimal risk, up to a constant that 
depends on Ep{Gp — G), which is the expected distance of the limit of the estimator and the true 
distribution. If the estimator estimates reasonably well, one can hope that this term is small, even 
under misspecification. 

Corollary 24. Assume that the conditions of Theorem 23 hold. Assume also (B1)-(B3) and 
that P{\\Gn — Cplloo > bn'^'^^^) — )• for any 6 > 0. Let A„ — )• 0, where < A„ < 1, and 
^(g+i)/2^mm{2(p+i),s} _^ where q is defined in Assumption (B2). Then for every e > 0, and 
every P £ V, 

l\m p(dg{ZxTx{0, 1})" : nLAfh,x) < n,p + \Ep{Gp -G)\+e)=l. 
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The proof is similar to the proof of Theorem 19 and is therefore omitted. 
We now show that the additional condition 

(14) P(||Gn-Gp||oo >6n-'/^)^0 

of Corollary 24 holds for both the Kaplan-Meier estimator and the Cox model estimator. 

Example 25 (Kaplan-Meier estimator). Let Gn be the Kaplan-Meier estimator of G. Let Gp 
be the limit of Gn- Note that Gp is the marginal distribution of the censoring variable. It follows 
from (9) that condition (14) holds for all s > 2. 

Example 26 (Cox model estimator). Let Gn be the estimator of G when the Cox model is 
assumed (see Example 2). Let Gp be the limit of Gn- It was shown that the limit Gp exists, 
regardless of correctness of the proportional hazard model (Goldberg and Kosorok, 2011). Moreover, 
for all e > 0, and all n large enough, 

Pi\\Gn - Gp\\ >e)< expi-Wme'^ + W^^e) , 

where Wi, W2 are universal constants that depend on the set Z, the variance of Z, the constants 
K and Kq, but otherwise do not depend on the distribution P (see Goldberg and Kosorok, 2011, 
Theorem 3.2, and conditions therein). Fix rj > and write 

^ V?? + W| + W2 
^ ~ 2Wi^/n 

Some algebraic manipulations then yield 

hmsupP G„ - Gp 00 > „r r- — <6 • 

n^oo V W\^n ) 

Hence, condition (14) holds for all s > 2. 

6. Simulation Study. In this section we illustrate the use of the censored SVM learning 
method proposed in Section 4 via a simulation study. We consider five different data generating 
mechanisms, including one-dimensional and multidimensional settings, and different types of cen- 
soring mechanisms. We compute the censored SVM decision function with respect to the absolute 
deviation loss function Lad- For this loss function, the Bayes risk is given by the conditional me- 
dian (see Example 7). We choose to compute the conditional median and not the conditional mean. 
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Fig 1. WetbuU failure time, proportional hazard (Setting 1): The true conditional median (solid blue), the SVM 
decision function (dashed red), and the Cox regression median (dot-dashed green) are plotted for samples of size 
n = 50, 100,200,400 and 800. The censoring percentage is given for each sample size. An observed failure times is 
represented by an x, and an observed censoring time is represented by an o. 



since censoring prevents reliable estimation of the unrestricted mean survival time when no further 
assumptions on the tail of the distribution are made (see discussion in Karrison, 1997; Zucker, 1998; 
Chen and Tsiatis, 2001). We compare the results of the SVM approach to the results obtained by 
the Cox model and to the Bayes risk. We test the effects of ignoring the censored observations. 
Finally, for multidimensional examples, we also check the benefit of variable selection. 

The algorithm presented in Section 4 was implemented in the Matlab environment. For the 
implementation we used the Spider library for Matlab^. The Matlab code for both the algorithm 
and the simulations can be found in Supplement A. The distribution of the censoring variable 
was estimated using the Kaplan-Meier estimator (see Example 1). We used the Gaussian RBF 
kernel k^{xi,X2) = exp(cr~^||xi — X2||2)5 where the width of the kernel a was chosen using cross- 
validation. Instead of minimizing the regularized problem (6), we solve the equivalent problem (see 

^The Spider library for Matlab can be downloaded form http : //www . kyb . tuebingen . mpg . de/bs/people/ spider/ 
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Fig 2. Weibull failure time, proportional hazard (Setting 1): Distribution of the risk for different sizes of data set, for 
standard SVM that ignores the censored observations (Ignore), for censored SVM (Censored), and for Cox regression 
median (Cox). Bayes risk is denoted by a black dashed line. Each box plot is based on 100 repetittons of the simulation 
for each stze of data set. 



SC08, Chapter 5): 

Minimize 'R'L^,D{f) under the constraint < A""*^ , 

where H is the RKHS with respect to the kernel k^, and A is some constant chosen using cross- 
vahdation. Note that there is no need to compute the norm of the function / in the RKHS space 
H exphcitly. The norm can be obtained using the kernel matrix K with coefficients kij = k{Zi,Zj) 
(see SC08, Chapter 11). The risk of the estimated functions was computed numerically, using a 
randomly generated data set of size 10000. 

In some simulations the failure time is distributed according to the Weibull distribution (Lawless, 
2003). The density of the Weibull distribution is given by 

p \pj 

where k > is the shape parameter and p > is the scale parameter. Assume that k is fixed and 
that p = exp(/3o + f3'Z), where /3o is a constant (3 is some coefficient vector, and Z is the covariates 
vector. In this case, the failure time distribution follows the proportional hazard assumption, i.e.. 
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Fig 3. Weibull failure time, non-lmear proportional hazard (Setting 2): The true conditional median (solid blue), the 
SVM decision function (dashed red), and the Cox regression median (dot-dashed green) are plotted for samples of size 
n = 50, 100,200,400 and 800. The censoring percentage is given for each sample size. An observed failure times is 
represented by an x, and an observed censoring time is represented by an o. 



the hazard rate is given by h{t\Z) = exp(/3o + (3'Z)dA{t), where A(t) = t'^. When the propor- 
tional hazard assumption holds, estimation based on Cox regression is consistent and efficient (see 
Example 2; note that the distribution discussed there is of censoring variable and not the failure 
time, nevertheless, the estimation is similar). Thus, when the failure-time distribution follows the 
proportional hazard assumption, we use the Cox regression as a benchmark. 

In the first setting, the covariates Z are generated uniformly on the segment [—1,1]. The failure 
time follows the Weibull distribution with shape parameter 2, and scale parameter —0.5Z. Note 
that the proportional hazard assumption holds. The censoring variable C is distributed uniformly 
on the segment [0, cq] where the constant cq is chosen such that the mean censoring percentage is 
30%. We used 5-fold-cross- validation to choose the kernel width and the regularization constant 
among the set of pairs 



{X-\a) = (0.1 • 10^0.05 • 2^) , i,j G {0,1,2,3} 
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Fig 4. WeibuU failure time, non-linear proportional hazard (Setting 2): Distribution of the risk for different sizes of 
data set, for standard SVM that ignores the censored observations (Ignore), for censored SVM (Censored), and for 
Cox regression median (Cox). Bayes risk is denoted by a black dashed line. Each box plot is based on 100 repetitions 
of the simulation for each size of data set. 
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Fig 5. Multidimensional Weibull failure time (Setting 3): Distribution of the risk for different data set sizes, for 
standard SVM that ignores the censored observations (Ignore), for censored SVM (Censored) , for censored SVM with 
variable selection (VS), and for Cox regression median (Cox). Bayes risk is denoted by a black dashed line. Each box 
plot is based on 100 repetitions of the simulation for each size of data set. 
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Fig 6. Multidimensional Weibull failure time, non-linear proportional hazard (Setting 4)-' Distribution of the risk 
for different data set sizes, for standard SVM that ignores the censored observations (Ignore), for censored SVM 
(Censored), for censored SVM with variable selection (VS), and for Cox regression median (Cox). Bayes risk is 
denoted by a black dashed line. Each box plot is based on 100 repetitions of the simulation for each given data set 
size. 



We repeated on the simulation 100 times for each of the sample sizes 50, 100, 200, 400, and 800. 

In Figure 1, the conditional median obtained by the censored SVM learning method and by Cox 
regression are plotted. The true median is plotted as a reference. In Figure 2, we compare the risk 
of the SVM method and to the median of the survival function obtained by the Cox regression (to 
which we refer as the Cox regression median). We also checked the effect of ignoring the censored 
observations by computing the standard SVM decision function for the data set in which all the 
censored observations were deleted. Both figures show that even though the SVM does not use 
the proportional hazard assumption for estimation, the results are comparable to those of Cox 
regression, especially for larger sample sizes. Figure 2 also shows that there is a non-negligible 
price for ignoring the censored observations. 

The second setting differs from the first setting only in the failure time distribution. In the second 
setting the failure time distribution follows the Weibull distribution with scale parameter —0.5Z^. 
Note that the proportional hazard assumption holds for Z^, but not for the original covariate Z. In 
Figure 3, the true, the SVM and the Cox regression median are plotted. In Figure 4, we compare 
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Fig 7. Step function median, Weibull censoring time (Setting 5): The true conditional median (solid blue), the SVM 
decision function using the Kaplan-Meier estimator for the censoring (dashed red), the SVM decision function using 
the Cox estimator for the censoring (doted megenta), and the Cox regression median (dot-dashed green) are plotted 
for samples of size n = 50, 100, 200, 400 and 800. The censoring percentage is given for each sample size. An observed 
failure times is represented by an x, and an observed censoring time is represented by an o. 



the risk of SVM to that of the Cox regression. Both figures show that that in this case SVM does 
better than the Cox regression. Figure 4 also shows the price of ignoring the censored observations. 

The third and forth settings are generahzations of the first two, respectively to a 10 dimen- 
sions. The covariates Z are generated uniformly on [—1, 1]^*^. The failure time follows the Weibull 
distribution with shape parameter 2. The scale parameter of the third and forth settings are 
-O.5Z1 + 2Z2-Z3 and -0.5(Zi)2 + 2(^2)2 -(^3)2, respectively. Note that these models are sparse, 
namely, they depend only on the first three variables. The censoring variable C is distributed 
uniformly on the segment [0, cq], where the constant cq is chosen such that the mean censoring per- 
centage is 40%. We used 5-fold-cross- validation to choose the kernel width and the regularization 
constant among the set of pairs 

(A-\ct) = (0.1 • 10*, 0.2 • 2^) , i,j e {0,1,2,3}. 
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Fig 8. Step function median, Weibull censoring time (Setting 5): Distnbution of the risk for different sizes of data set, 
for standard SVM that ignores the censored observations (Ignore), for censored SVM wtth Kaplan-Meier estimator for 
the censoring (Misspecified) , for censored SVM with Cox estimator for the censoring (True), and for Cox regression 
median (Cox). The Bayes risk is denoted by a black dashed line. Each box plot is based on 100 repetitions of the 
simulation for each size of data set. 



The results for the third and the forth settings appears in Figure 5 and Figure 6, respectively. 
We compare the risk of standard SVM that ignores censored observations, censored SVM, censored 
SVM with variable selection, and the Cox regression. We performed variable selection for censored 
SVM based on recursive feature elimination as in Guyon et al. (2002, Section 2.6). When the 
proportional hazard assumption holds (setting 3), SVM performs reasonably well, although the 
Cox model performs better as expected. When the proportional hazard assumption fails to hold 
(setting 4), SVM performs better and it seems that the risk of the Cox regression converges, but not 
to Bayes risk (see Example 26 for discussion). Both figures show that variable selection achieves 
a slightly smaller median risk with the price of higher variance and that ignoring the censored 
observations leads to higher risk. 

In the fifth setting, we consider a non-smooth conditional median. We also investigate the in- 
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fluence of using a misspecified model for the censoring mechanism. The covariates Z are gener- 
ated uniformly on the segment [—1,1]. The failure time is normally distributed with expectation 
3 + 31{Z<0} and variance 1. Note that the proportional hazard assumption does not hold for 
the failure time. The censoring variable C follows the Weibull distribution with shape parameter 
2, and scale parameter —0.5Z + log{6) which results in mean censoring percentage of 40%. Note 
that for this model, the censoring is independent of the failure time only given the covariate Z (see 
Assumption (A2)). Estimation of the censoring distribution using the Kaplan-Meier corresponds 
to estimation under a misspecified model. Since the censoring follows the proportional hazard as- 
sumption, estimation using the Cox estimator corresponds to estimation under the true model. We 
use 5-fold-cross-validation to choose the regularization constant and the width of the kernel, as in 
setting 1. 

In Figure 7, the conditional median obtained by censored SVM learning method using both the 
misspecified and true model for the censoring, and by Cox regression are plotted. The true median 
is plotted as a reference. In Figure 8, we compare the risk of the SVM method using both the 
misspecified and true model for the censoring. We also checked the effect of ignoring the censored 
observations. Both figures show that in general SVM does better than the Cox model, regardless 
the censoring estimation. The difference between the misspecified and true model for the censoring 
is small and the corresponding curves in Figure 7 almost coincide. Figure 8 shows again that there 
is a non-negligible price for ignoring the censored observations. 

7. Concluding Remarks. We studied an SVM framework for right censored data. We pro- 
posed a general censored SVM learning method and showed that it is well defined and measurable. 
We derived finite sample bounds on the deviation from the optimal risk. We proved risk consis- 
tency and computed learning rates. We discussed misspecification of the censoring model. Finally, 
we performed a simulation study to demonstrate the censored SVM method. 

We believe that this work illustrates an important approach for applying support vector machines 
to right censored data, and to missing data in general. However, many open questions remain and 
many possible generalizations exist. First, we assumed that the censoring is independent of the 
failure time given the covariates, and the probability that no censoring occurs is positive given 
the covariates. It should be interesting to study the consequences of violation of one or both 
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assumptions. Second, we have used the inverse-probabihty-of-censoring weighting to correct the 
bias induced by censoring. In general, this is not always the most efficient way of handling missing 
data (see, for example, van der Vaart, 2000, Chapter 25.5). Third, we discussed only right-censored 
data and not general missing mechanisms. We believe that further developing SVM techniques, 
that are able to better utilize the data and to perform under weaker assumptions in more general 
settings, is of great interest. 

APPENDIX A: PROOFS 

A.l. Proof of Theorem 13. 

Proof. Note that 

M\fh ,x\\h + '^L1,,D{fD,\) ^ '^II/p.aII// + T^L1,,D{fp,\), 
where TZi-Mf) = ^n6L{Z, U, f{Z))/Gn{U\Z). Hence, 

A||/B,aIIh +^l,p(/|,,a) - inf ^L,p(/) - A2{\) 

= mkxWl + T^LAfkx) - mpAl - T^LAfp,x) 

= T^L,p{fD,x) - '^L^,DifD,x) 
^^^-j + HfD,x\\H + '^Ll,D{fh,x) - mP,x\\H - T^Ll,D{fp,x) 

+ ^LS,D(/p,A)-7^L,p(/p,A) 

< ^l,p(/d,a) - T^L-nMfh,x) + T^LiMfp,x) - T^LAfp,x) 

= T^L,p{fD,x) - '^LG,D{fD,x) + T^LcDifh^) " '^L'lj,D{fD,x) 

+ ^ls,d(/p,a) - nioMfPA) + T^LGMfp,x) - T^LAfp,x) > 

where 

nioMf) = ^nLG{Z,U,6J{Z)) = FnSL{Z,YJ{Z))/G{T\Z) , 

i.e., TZlq^d is the empirical loss function with the true censoring distribution function. Using con- 
ditional expectation, we obtain that for every f £ H, 

6 



nLAf) = Ep[LiZ, Y, f{Z))] = Ep 



Ep 



G{T\Z) 

Ep[LG{Z,U,6,f{Z)]=nLaAf)- 



L{Z,YJiZ)) 



Z,T 
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We conclude that 

Mlfhjl + T^LAfkx) - inf 7^i,p(/) -^2(A) 

(16) < {\nLaAfh,x)-T^LaMfh,x)\ + I^Lg,p(/p,a) - 7^L^,D(/p,A)|) 

+ {\nLa,D{fh,x)-nL^,ArD,x)\ + \nL-Mfp,x)-T^LaMfp,x)\) ■ 

We would like to replace the bound above, with a bound that does not depend on the functions 
^ and /p,A- To do that we first bound the norm of both fp^x and ^. We start with /p,a- Since 

L is a convex locally Lipschitz continuous loss function, we obtain that fpx exists (see Section 3). 

Since 

(17) A||/p,a|||^ < X\\fp,x\\l + T^LAfp,x) = H Wffn + T^iAf) < T^lA^) , 

and L{z,y,0) < 1, we conclude that ||/p,A||/f ^ A^^/^. Moreover, since < 1, it follows from 
Lemma 4.23 of SC08 that ||/p,a||oo < A"^/^. In addition, for / G X'^/'^Bh, where Bh be the unit 
ball oi H, we have 

\L{z, y, f{z))\ < \L{z, y, f{z)) - L{z, y,0)\+ L{z, y, 0) < cl(A-i/2)^-i/2 ^ ^ _ 

Following the same arguments, this time for the empirical distribution induced by D, and using the 
fact that L^{z,y,0) < (see Remark 4), we conclude that ||/£)aIIh ^ (KX)^^/'^. In addition, 
for every / G {KX)^^^^Bh, we have 

\LUz, y, f{z))\ < )- (|L(z, y, /(z)) - L(z, y, 0)| + L(z, y, 0)) 

(18) f 

<-i(c,((i^A)-V2)(i^A)-V2 + i)^5. 

Note that since Z is compact, the closure of Bu in Loo(Z) is also compact (SC08, Corollary 4.31). 
This compactness ensures that M{Bh,\\ • ||oo,e) < oo for every e > 0. Let J^e be an e-net of 
{KXY^/'^Bh with cardinality 

\Fe\=M{{KX)-^/^BHA\ ■ \\oo,e) =M{BhA\ ■ U,{KXf'^e). 

Note that for every / G {KX)"^/"^ Bh there \s a. g = g{f) £ such that ||/ — ^Hoo < Hence, 
using the local Lipschitz continuity of L and L^, and the fact that ci^^ia) < K^^cl{cl) for every 
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a > 0, we obtain for every / G {K\)^^/'^Bh 

\nLaAf)-T^LaMf)\ 

< \nLa,p{f ) - T^LaA9)\ + iT^LaAd) " nLaMsM + I^Lo,d(/) - nLaMd)] 

< 2eK'^CL {{K\)-^'A + \nLaA9) " T^LaA9)\ ■ 



Taking suprema on both sides, we obtain 

sup \nLaAf) - T^LaMf)\ < 2eK'^CL{{KXr'/^) + sup lULaAd) " T^LgMs)\ ■ 
Combining the above bounds, we are ready to bound the first part of (16). For ah r/' > 0, 



(19) 



P {[^LaAfh^) - T^LaMfh,x)\ + \T^LaAfp,x) " '^LaMfp,x)\ > + AeK-'cL{{KXy 



1/ 



< P sup iTZLaAa) - T^LaM9)\ > B\. ^ 



< ^ P ( lULaAa) - nLaM9)\ > ^Vln 



<2MiBHA\-\\oo,iKX)'/h)e~^', 

where the last inequahty follows form Hoeffding's inequality (SC08, Theorem 6.10). Write rj 
rj' - \og2N{BH, II • I loo, {KXy/'^e), and obtain that 



(20) \TlLaAf'b,x) - T^LaAf'b,x)\ + \T^LaAfp,x) " ^L,p(/p,; 



^ ^j2, + 21og(2^(i..,|M|^,(KA)V%)) ^ ,^^-.^^((^,)-v.) 



n 



with probability not more than e . 

We are now ready to bound the second expression of (16). For every / G {KX)~^^'^Bh 

6L{Z,Y,f{Z)) ^ 5L{Z,YJiZ)) 



(21) 



< 



G{T\Z) 
L{Z,YJ{Z)) 



G{T\Z)Gn{T\Z) 



Gn{T\Z) 
||G(r|Z)-G„,(T|Z)|| 



B . 
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where the last inequahty follows from (18) and Assumption (Al). Consequently, we obtain 

\'^LaMfh,x) - '^Ll,D{fh,x)\ + |7^LS,d(/p,a) - nLa,D{fP,x)\ < • 

Combining this bound with (19), and applying additional algebraic calculations, yields the asser- 
tion. □ 

A. 2. Proof of Theorem 23. 

Proof. The proof is based on utilizing (13) to replace the bound for -j^ obtained in (21) of 
Theorem 13 with a bound on f£)x- Note that since L can be clipped at M, for every f £ H we 
have 

Hence, we can obtain a similar bound to (15): 

A||/^,aIIh +^l,p(/d,a) - inf ^L,p(/) - ^2(A) 

= M\fh,xfH + ^l,p(/d,a) - A||/p,a||1, - ^l,p(/p,a) 

<nLArD,x)-T^LiMrD,x) 

+ T^L-Mfp,x)-T^LAfp,x) 

< (^L,p(/d,a) -7^L£,d(7B,a)) + (^L£,d(/p,a) -7^L,p(/p,A)) 
= -^n + Bn ■ 

We start by bounding An- Note that 

An = {TZLaAfD^x) - T^LoMhx)) + [^La,D{h,x) - T^Ll,DirD,x)) ■ 

Starting with the first expression of An, note that by (17) and the arguments that follow, ||/p,a||oo ^ 
A^^/^ < (KX)^^^"^. Repeating the arguments that lead to (20), together with Lemma 27, we obtain 
that for any / G {K\)-^''^Bh 

(23) 



l-p (h -v ^;m. i?j 2r? + 21og(2AA(ij^,||-|U,(i^A)V%)) ^ ^ ,,^.._i/2. 
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with probability not more than e-'?, where B = {cl {{KX)-^/'^) {K\)-^I^ + l). As for the second 
expression in An, note that for any chpped function /, we have 



\nLaAf)-T^L^hAf)\< 



5L{Z,Y,f{Z)) 5L{Z,Y,f{Z)) 
G{T\Z) " Gp{T\Z) 



(24) 



5L{Z,YJ{Z)) ^ 5L{Z,YJ{Z)) 



< 



W 
■210 



Gp{T\Z) Gn{T\Z) 

\Gp-Gn\\oo + \^n{Gp-G){T\Z)\] , 



where the last inequality follows from condition (Al). 
We now bound Bn- Note that 

<(7^LS,D(/P,A)) - 7^L2„Z)(/|.,A)) 

The second and third components of Bn can be bounded by (23) and (24), respectively. We now 
bound the first expression of Define /il(/)(Z, T) = L{Z, T, f{Z)) - L{Z, T, f*p{Z)) where f*p E 
argmiuj 7^i^p(/), and where the minimum is taken over all measurable functions /: 

6L{Z, T, fp,x{Z)) - 6L{Z, T, hxi^)) 



T^L1^,D{fp,x)) - ■^L2,,d(/p,a) 



< 



GniZ\T) 

L{Z,TJp^x{Z))-L{Z,T,rp^^{Z)) 
hL{fp,x) - hUrpx)) ■ 



K 

It follows from Bernstein's inequality (see Eq. (7.41) of SC08, for details) that with probability not 
less than 1 — e~'^, 



hL{fp,x) - hLifp^,) ) < |P (hUfp,x) - hUfp^x)) + ^ 



P ( hL{fp,x) - hLifp^x) ) = ^l,p(/p,a) - nLAfp,x) < T^LAfp,x) - jnJ7^L,p(/) < ^2(A) . 



Note that 



Summarizing 

(25) nLiMfp,^)) - T^L-Mfh) ^ K 

Combining (23), (24), and (25) yields the assertion. 



2^2 (A) TWqt] 
6Kn 



□ 
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A. 3. Additional Proofs. 

Proof of Lemma 22. Define 

g{z, y) = inf L{z, y,s), {z,y, s) £ Z x y x R . 

s 

By assumption, for every {z,y) £ Z x y, mfg L{z,y, s) is obtained at some point s = s{z,y). 
Moreover, since L is strictly convex, sq is uniquely defined. 

We now show that g{z, y) is continuous at a general point (zq, yo). Let {{zn, yn)} be any sequence 
that converges to (zcyo)- Let m„ = g{zn,yn) = L{zn,ym Sn), and assume by contradiction that 
m„ does not converge to ttiq. Since g{z,y) is bounded from above by maxz^y L{z,y,0), and Z x y 
is compact, there is a subsequence {?ti„j.} that converges to some m* ^ ttiq. By the continuity 
of L, there is a further subsequence {suk^} ^ (— c«,c«) such that L{zrn,^-,yn^-, Sn^) = ""^n^^ and 
{•^nfc;} converges to s* G [—00,00]. If s* G (—00,00), then by definition rriQ = inig L{zo,yo, s) < 
L{zo,yo, s*) = m*, and hence from the continuity of L for all n large enough L{zn^^^,yn^^, sq) < 
L{z 

"fc, ' y^k, 1 ^^k, )' and we arrive at a contradiction. 

Assume now that s* ^ (—00, 00), and without loss of generality, let s* = 00. Note that max^^^ g[z, y) 
is bounded from above by Mq = maxz^y L(z,y,0). Chose sm > sq such that for all s > sm, 
L{zQ,yQ,s) > 3Mq. By the continuity of L, there is an e > such that for all {z,y) G Bir{zo,yQ) H 
Z X y, L{z, y, Sm) > 2Mo and note that L(z, y, so) < 2Mo- Recall that L is strictly convex in the 
last variable, and hence it must be increasing at sm for all points {z,y) G i?e(2;o,yo) Ci Z x y (see 
for example Niculescu and Persson, 2006, Proposition 1.3.5). Consequently, for all n big enough, 
L(z„j.^ , , Sn^^ ) > 2Mo, and we arrived at a contradiction, since m„^,^ < Mq. 

We now show that s{z,y) = argmin^ ^(z, y, s) is continuous at a general point (zQ^yo). Let 
{{zn,yn)} be a sequence that converges to (zo,yo)- Let ■5?! — s{zn, y-n) ■ Assume, by contradiction, 
that Sn does not converge to so- Hence, there is a subsequence {sn;,} that converges to some s* G 
(—00,00) {s* G {—00,00} cannot happen, see above). Hence, lim L{zn^, yn^, Sn^) = limning. = tuq, 
and L{zq, yo, sq) = L{zo, yo, s*) = mo, which contradicts the fact that L{zq, y^, •) is strictly convex 
and therefore has a unique minimizer. 

Since s{z,y) is continuous on a compact set, there is an M such that |s(z,y)| < M. It then 
follows from Lemma 2.23 of SC08 that L can be clipped at M. □ 
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Lemma 27. Let be a function space and denote = {f : f G J^}. Then 

AA(^,||-||,£)<AA(J-,||.||,£). 

Proof. Assume that \\ ■ \\,£) = K < oo, otherwise the assertion is trivial. Let /i, . . . , /x 

be such that T G UfeLi B{fk,£), where B{f, e) is the e-ball with respect to the norm || • ||, centered 
at /. Note that since |s — i| < \s — t\ for ah s,t E M., we have — /2II < ||/i — /2II and hence 
^ e Uf=i B{fk, s). The result follows. □ 

SUPPLEMENTARY MATERIAL 

Supplement A: Matlab Code 

(). Please read the file README.pdf for details on the files in this folder. 
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